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Abstract 

Most of the syntax-based metrics ob¬ 
tain the similarity by comparing the sub¬ 
structures extracted from the trees of 
hypothesis and reference. These sub¬ 
structures are defined by human and can’t 
express all the information in the trees 
because of the limited length of sub¬ 
structures. In addition, the overlapped 
parts between these sub-structures are 
computed repeatedly. To avoid these prob¬ 
lems, we propose a novel automatic eval¬ 
uation metric based on dependency pars¬ 
ing model, with no need to define sub¬ 
structures by human. First, we train a de¬ 
pendency parsing model by the reference 
dependency tree. Then we generate the 
hypothesis dependency tree and the cor¬ 
responding probability by the dependency 
parsing model. The quality of the hypothe¬ 
sis can be judged by this probability. In or¬ 
der to obtain the lexicon similarity, we also 
introduce the unigram F-score to the new 
metric. Experiment results show that the 
new metric gets the state-of-the-art perfor¬ 
mance on system level, and is comparable 
with METEOR on sentence level. 

1 Introduction 

Automatic machine translation (MT) evaluation 
not only evaluates the performance of MT sys¬ 
tems, but also accelerates the development of MT 
systems ( |Qch, 2003] ). According to the type of the 
employed information, the automatic MT evalua¬ 
tion metrics can be classified into three categories, 
lexicon-based metrics, syntax-based metrics and 
semantic-based metrics. 

Most of the syntax-based evaluation metrics 
obtain the similarity between reference and hy¬ 
pothesis by comparing the sub-structures be¬ 
tween the trees of reference and hypothesis, such 


as HWCM dEiu and Gildea, 2005 | l and the EEG- 
based metric ( Owczarzak et al., 2007 | l. HWCM 
uses the headword chains extracted from the de¬ 
pendency tree, while the EEC-based metric uses 
the Eexical-Eunctional Grammar dependency tree. 
Some syntax-based metrics calculate the sim¬ 
ilarity between the sub-structure of the refer¬ 
ence tree and the string of hypothesis, such as 
BEEUATRE (Mehay and Brew, 20071 and RED 
dYu et al., 2014| ). The sub-structures in these met¬ 
rics are defined by human and can’t express all 
the information in the trees because of the lim¬ 
ited length of sub-structures. In addition, the 
overlapped parts between these sub-structures are 
computed repeatedly. 


To avoid the above defects, we propose a new 
metric from the view of dependency tree genera¬ 
tion. We don’t need to define sub-structures by 
human for the new metric. We train a dependency 
parsing model by the reference dependency tree. 
By this model, we can obtain the dependency tree 
of the hypothesis and the corresponding probabil¬ 
ity which is also the score of the dependency pars¬ 
ing model. The syntactic similarity between the 
hypothesis and the reference can be judged by this 
score. In order to obtain the lexicon similarity, 
we also introduce the unigram E-score to the new 
metric. The experiment results show that the new 
metric gets the-state-of-art performance on system 
level evaluation, and gets the comparable correla¬ 
tion with METEOR on sentence level evaluation. 


The remainder of this paper is organized as fol¬ 
lows. Section 2 describes the maximum-entropy- 
based dependency parsing model. Section 3 
presents the new MT evaluation metric based on 
dependency parsing model. Section 4 gives the 
experiment results. Conclusions and future work 
are discussed in Section 5. 










2 Maximum-entropy-based Dependency 
Parsing Model 


modelME- When calculating the score of a transi¬ 
tion action, we use Formula ([2]). 


Shift-reduce algorithm is used in the dependency 
parsing model. In the shift-reduce algorithm, the 
input sentence is scanned from left to right. In 
each step, one of the following two actions is se¬ 
lected, shift the current word into the stack or re¬ 
duce the two (or more than two) items on the top 
of the stack to one item. 

Generally, the reduce action includes two sub¬ 
actions reduceL and reduce r. reduce l means 
that the left item is considered as the head after 
reducing, and reduceR means that the right item 
is considered as the head after reducing. Formally, 
the transition state in the shift-reduce parser can be 
represented as a tuple < S,Q, A >. S is a stack. 
Q is a sequence of unprocessed words. A is the 
already-built set of dependency arcs, which is part 
of the dependency tree in the current state. In each 
step, one of the following three actions is selected. 

• shift: shift the head word in the queue Q 
into the Stack S. 

• reduceR'. merge the top two items {st and 
st_i) in S into st, t >= 2. st is considered as 
the head, and the left arc {st, st-i) is added 
to the set A. 

• reduceR: merge the top two items {st and 
st-i) in S into st-i, t >= 2. st_iis consid¬ 
ered as the head, and the right arc (st-i, st) 
is added to the set A. 

In the traditional shift-reduce decoder algo¬ 
rithm, the next action can be predicted by For¬ 
mula O, when the state of the dependency 
parser is s. In Formula ([T]), action = 
{shift, reduceR, reduceR}. scorCactiT, s) is the 
score of action T when the current state is s. 

T{s) = argmaxT&actionScoreact{T, s) (1) 

We use the method of classification to decide 
which action should be chosen in the transition 
sequence. We combine the action and the corre¬ 
sponding context as a training example, which de¬ 
scribes which action should be chosen in a certain 
context. The context can be represented as a se¬ 
ries of features. The feature templates used in this 
paper are the same as those used in Huang et al. 

(I2TO . 

We use the maximum entropy as the classi¬ 
fication method to train the examples and get 


scoreact{T', s) = ^ hfi{T', s) (2) 
i 

fi{T', s) is the ith feature when the current state 
is s and the transition action is T'. Aj is the weight 
of the ith feature. In shift-reduce algorithm, there 
are three kinds of actions in each transition action. 
The probability that the scores of all the three ac¬ 
tions are zero is very low, because the feature tem¬ 
plates include POS (Part-of-Speech) of the current 
word and POS of the two words before the current 
word. If modelME chooses two kinds of actions, 
the score of the third action is zero. To avoid the 
zero score, we use the normalization method in 
Formula ([3]). Pact{T\ s) is the normalized prob¬ 
ability of the chosen action T' when the current 
state is s. z is the constant for normalization. 
set{s) in Formula (@1) is the set of all possible ac¬ 
tions when the current state is s. 


Pact{T', s) = - • exp(^ Xifi{T', s)) 


(3) 


z = 


^ expC^Xifi(T',s)) 


(4) 


T'Gset(s) 


Beam 


search 


algorithm 

(Zhang and Clark, 20081 is used in shift-reduce 
decoder algorithm. For a sentence x, we can 
get many dependency trees and use gen{x) to 
represent the set of the dependency trees. Then 
the best one can be obtained by Formula (0. 
actset{y) represents the set of all the actions 
when generating dependency tree y. 


free(x) = argmax ^ log{PactiT', sr}) 

yGgen(x) r/gacf^et(y) 

(5) 

modelME is trained with the data which contain 
the information in the process of dependency pars¬ 
ing and is used to parse a sentence. So we name 
the trained model modelME as dependency pars¬ 
ing model. The score of the dependency parsing 
model is defined in Formula (l6]l. 

Score{x) = ^ log{Pact(T', sr}) 

T' ^actset(tree{x)) 


( 6 ) 





3 Dependency-parsing-model-based MX 
Evaluation Metric 

3.1 Training of Dependency Parsing Model 

We should get the reference dependency tree first 
for training dependency parsing model. The ref¬ 
erence dependency tree can be generated by the 
open-source tools or labeled by human. We use 
the Stanford tool^ to generate reference depen¬ 
dency tree. After obtaining the reference depen¬ 
dency tree, we can use it to train the dependency 
parsing model. The reference dependency tree is 
used as training corpus to extract features, accord¬ 
ing to the feature templates defined in Huang et al. 
(120091) . A training example is achieved by com¬ 
bining the features and the action in shift-reduce 
algorithm. The format of the training example is 
shown in Table [T] We train the extracted exam- 


Action 

Features(context) 

SHIFT 

sOw-sOt=Economia NNP sOw=. 

RIGHT 

s0w-s0t=and CC sOw=and. 

LEFT 

sOw-sOt=link VB sOw=link. 




Table 1: The format of training example. sOw rep¬ 
resents the word on the top of stack. sOt represents 
the POS of the top word in stack. 

pies using the maximum entropy and get a depen¬ 
dency parsing model. According to the method 
introduced in Section 2, we parse the hypothesis 
using this dependency parsing model. We can get 
a Score(hyp) of the dependency parsing model for 
hypothesis hyp as in Formula (O. 

We train a dependency parsing model for each 
sentence separately. That is to say, the reference 
dependency tree of sentence i is only used to train 
the dependency parsing model for the hypothesis 
of sentence i. We also tried other methods, such 
as using all the reference dependency trees to train 
the model for each hypothesis, or adding a back¬ 
ground corpus together with the reference depen¬ 
dency tree to train the model for each hypothe¬ 
sis. For the above two methods, we give a higher 
weight to the dependency tree of sentence i when 
training the model for hypothesis i. However, for 
these two methods, the performance is worse than 
only using the reference dependency tree of sen¬ 
tence i when training the model for hypothesis i. 

’http://nlp.stanford.edu/software/stanf 


The dependency parsing model is trained by 
maximum entropy model, which can ensure 
smoothness when satisfying all of the conditions. 
In the case of data sparse, all the features of all 
the actions in a state may be zero, according to 
Formula ([S]). For this state, the probabilities of 
all the actions are equal. Sometimes none of the 
words in hypothesis appears in reference, but the 
POS of some words may appear in the reference. 
The dependency parsing model can differentiate 
this case, because POS is used in the feature tem¬ 
plates. Table |2] gives a reference, two hypotheses 
and the corresponding POS sequences of the three 
sentences. We can see that, none of the words in 
hypl or hyp2 appears in the reference, but the POS 
of some words appear in the reference. According 
to the dependency parsing model defined in For¬ 
mula (O, we can get Score{hypl) = —4.46 and 
Score{hyp2) = —5.87. From these two scores, 
we can conclude that hypl is better than hyp2, 
which is the truth. 

3.2 Normalization of the Dependency 
Parsing Model Score 

A transition sequence is obtained in the process of 
generating the dependency tree according to the 
shift-reduce algorithm. Each word in the sentence 
should be pushed into the stack once, and each 
word is popped from the stack once for reduction 
except the root node. Therefore, there are n steps 
of shift actions and n — 1 steps of reduce actions, 
2n — 1 actions in all, which means that the length 
of the transition sequence is 2n — 1. n is the length 
of the sentence. The score of the dependency pars¬ 
ing model is the sum of the logarithms of the tran¬ 
sition actions’ probabilities, as in Formula ([^. Be¬ 
cause the value is negative after the logarithm, it 
will cause penalty for long sentences. Some sen¬ 
tences can achieve high scores because of a shorter 
length and not because of higher quality. There¬ 
fore, we need to normalize the score of the depen¬ 
dency parsing model, as in Formula ([7]). hyp is a 
hypothesis, n is the length of hyp. Score{hyp) is 
defined in Formula ([6]). The normalized score of 
the Dependency Parsing Model is named as DPM 
which is a value between 0 and 1. 


DPM = (7) 

.-dependencies . shtml 2 ti — i 


















word sequence 

POS sequence 

ref 

my objective is to discover the truth . 

PRP NN VBZ TO VB DT NN . 

hypl 

our goal was finding facf. 

PRP NN VBZ VBG NN . 

hyp2 

was finding our goal facl. 

VBZ VBG PRP NN NN . 


Table 2: An example for the case that none of the words in hypl or hyp2 appears in reference but the 
POS of some words appear in the reference. 


3.3 Lexical Similarity 

Dependency parsing model mainly evaluates the 
syntax structure similarity between the reference 
and the hypothesis. Besides the syntax structure, 
another important factor is the lexical similarity. 
Therefore, unigram F-score is used to represent 
the lexical similarity in our metric. 

F-score can be calculated by Formula ([8]l. a is 
a decimal between 0 and 1, which can balance the 
effects of precision and recall. P means precision 
and R means recall. 


F-score = 


Px R 

axP-\-{l — a)xR 


( 8 ) 


Many automatic evaluation metrics can 
only find the exact match between the refer¬ 
ence and the hypothesis, and the information 
provided by the limited number of refer¬ 
ences is not sufficient. Some evaluation 
metrics, such as TERp dSnover et al., 200^ 
and METOER ( [Banerjee and Eavie, 2005 


Eavie and Denkowski, 2009^ 


Denkowski and Eavie, 2014] ), introduce ex¬ 


tra resources to expand the reference infor¬ 
mation. We also introduce some extra re¬ 
sources when calculating E-score, such as stem 
dPorter, 2001 1 ), synonyrrQ and paraphrase. Eirst, 
we obtain the alignment with Meteor Aligner 
dPenkowski and Eavie, 201 1] ) in which exact, 
stem, synonym and paraphrase are all considered. 
Then we can find fhe mafched words using fhe 
alignmenf, and every mafched word corresponds 
fo a mafch module fype (exacf, sfem, synonym or 
paraphrase). Differenl mafch module fypes have 
differenl mafch weighfs, which can be represenfed 

as Wexact! 'l^sterm '^synonym cind Wparaphrase- 


The words wifhin a senfence can be classified 
info confenf words and funcfion words. The ef- 
fecfs of fhe fwo kinds of words are differenf and 
they should not have the same matching score, so 
we introduce a parameter wj to distinguish them. 


^http://wordnet.princeton.edu 


After introducing extra resources, the precision 
P and recall R can be calculated by Eormula (|9ll 
and Eormula (fTOl) respectively. 

p ^ Ei mi • {wf ■ fh{i) + (1 - Wf) ■ Cfe(i)) 
Wf ■ numc{h) -h (1 — Wf) ■ numf{h) 

(9) 

p ^ • i'Wf • /r(f) + (1 - Wf) • Crji)) 

Wf ■ numc{r) + (1 — wj) ■ numf{r) 

( 10 ) 

In Eormula (|9ll, f is the ith word in the matched 
unigrams, 0 < f < n, and n is the number of the 
matched unigrams, rrii is the weight of the match 
module which the ith matched word belongs to. 
Wf is the weight of function words. numf{h) is 
the number of function words in the hypothesis, 
and numc{h) is the number of content words in 
the hypothesis, /^(f) represents whether the ith 
matched unigram in hypothesis is function word. 


fh{i) 


1 if function word 
0 if not function word 


Ch{i) represent whether the ith matched unigram 
in hypothesis is content word. 


Ch(0 


1 if content word 
0 if not content word 


In Eormula (fTOl) . i, mi and Wf have the same 
meanings as those in Eormula (|9l). numf{r) and 
numc{r) are the number of function words and 
content words respectively in reference. fr{i) rep¬ 
resents whether the ith matched word in reference 
is function word. 


fr{i) 


1 if function word 
0 if not function word 


Cr{i) represent whether the ith matched unigram 
in reference is content word. 


Cr{i) 


1 if function word 
0 if not function word 




























Parameter 

Meaning 

a 

balance the effects of precision and recall 

Wf 

differentiate the effects of function word and content word 

'^exact 

match weight for match module type exact 

'^stem 

match weight for match module type stem 

'^synonym 

match weight for match module type synonym 

'^paraphrase 

match weight for match module type paraphrase 


Table 3: The meanings of parameters in DPMF. 


data 

cs-en 

de-en 

es-en 

fr-en 

ru-en 

hi-en 

WMT2012 

6 

16 

12 

15 

- 

- 

WMT2013 

12 

23 

17 

19 

23 

- 

WMT2014 

5 

13 

- 

8 

13 

9 


Table 4: The number of translation systems for each language pair on WMT 2012, WMT 2013 and WMT 
2014. cs-en means Czech to English, de-en means German to English, es-en means Spanish to English, 
fr-en means Erench to English, ru-en means Russian to English, hi-en means Hindi to English. 


language pair 

a 

Wf 

^exact 

'^stem 

'^synonym 

^paraphrase 

*-en 

0.85 

0.25 

1.0 

0.6 

0.8 

0.6 


Table 5: Parameter values of DPME *-en represents all the language pairs with English as target lan¬ 
guage. 


3.4 Final Score of DPMF 

After obtaining the score of dependency parsing 
model and lexical similarity, we can calculate the 
final score of the new metric. Because we use 
both the Dependency Parsing Model and E-score, 
we name the score as DPME. As in Eormula (fTTl) . 
DPME can evaluate the similarities both on syntax 
and on lexicon. 

DPMF = DPM X F-score (11) 

The system level score is the average score of 
all the sentences. There are some parameters when 
calculating E-score. The meaning of each param¬ 
eter is listed in Tabled 

4 Experiment 

To verify the effectiveness of DPM and DPME, we 
carry out experiments on both the system level and 
the sentence level. H 


pairs are Czech-to-English, German-to-English, 
Spanish-to-English, Erench-to-English, Russian- 
to-English and Hindi-to-English. The number 
of translation systems for each language pair are 
shown in Table HI 

All the parameters of DPME are also included 
in METEOR and METEOR has tuned these pa¬ 
rameters for better performance. So we use the 
same parameter values as METEOR as empirical 
value in DPME and don’t need to tune the param¬ 
eters again. The parameter values used in the ex¬ 
periment are listed in Table 


4.2 System Level Correlation 


To evaluate the correlation with human judges, 
Spearman’s rank correlation coefficient p is used 
for system level, p is calculated using Eormula 

CIl). 


6E<i? 


n{'n? — 1) 


( 12 ) 


4.1 Data 

The data used in the experiment are WMT 2012, 
WMT 2013 and WMT 2014. The language 

^Interested readers can find the 

source code of DPM and DPMF from 
https://github.comAuHui0117/AMTE/tree/master/DPMF. 


di is the difference between the human rank and 
metrics rank for system i. n is the number of sys¬ 
tems. 

In the experiment, we give the correlations of 
DPM and DPME respectively. Eor comparison, 
the baseline metrics are the widely-used metrics. 






























metrics 

cs-en 

de-en 

es-en 

fr-en 

avg 

TER 

.886 

.624 

.916 

.821 

.812 

BEEU 

.886 

.671 

.874 

.811 

.811 

METEOR 

.657 

.885 

.951 

.843 

.834 

•SEMPOS 

.940 

.920 

.940 

.800 

.900 

DPM 

.943 

.735 

.888 

.821 

.847 

DPME 

.943 

.909 

.951 

.850 

.913 


(a) System level correlations on WMT2012. 


metrics 

cs-en 

de-en 

es-en 

fr-en 

ru-en 

avg 

TER 

.800 

.833 

.825 

.951 

.581 

.798 

BEEU 

.946 

.851 

.902 

.989 

.698 

.877 

•METEOR 

.964 

.961 

.979 

.984 

.789 

.935 

DPM 

.945 

.880 

.937 

.951 

.800 

.903 

DPME 

.991 

.975 

.993 

.984 

.849 

.958 


(b) System level correlations on WMT2013. 


metrics 

cs-en 

de-en 

fr-en 

hi-en 

ru-en 

avg 

TER 

.976 

.775 

.952 

.618 

.809 

.826 

BEEU 

.909 

.832 

.952 

.956 

.789 

.888 

METEOR 

.980 

.927 

.975 

.457 

.805 

.829 

•*DISCOTK-PARTY-TUNED 

.975 

.943 

.977 

.956 

.870 

.944 

*EAYERED 

.941 

.893 

.973 

.976 

.854 

.927 

*DISCOTK-PARTY 

.983 

.921 

.970 

.862 

.856 

.918 

*UPC-STOUT 

.948 

.915 

.968 

.898 

.837 

.913 

VERTA-W 

.934 

.867 

.959 

.920 

.848 

.906 

DPM 

.988 

.817 

.946 

.934 

.858 

.909 

DPME 

.999 

.920 

.967 

.882 

.832 

.920 


(c) System level correlations on WMT2014. 


Table 6: System level correlations on WMT 2012, WMT 2013 and WMT 2014. The value in bold is the 
best result in each column, avg stands for the average result of all the language pairs for each metric on 
WMT 2012, WMT 2013 or WMT 2014. Metrics with * are the hybrid metrics. Metrics with • are the 
best performance metrics in each data set. 


BLET0, TEK0 and METEOrU In addition, we 


metric^ which include many kinds of other 


also give the correlations of the metrics with the 
best performance on average according to the 
published results of WMT 2012, WMT 2013 and 
WMT 2014. Eor WMT 2012 and WMT 2013, 
the metrics with the best performance on average 
are SEMPOS (Machacek and Bojar, 20111 
and METEOR respectively. Eor WMT 
2014, the top-four metrics are DISCOTK- 
PARTY-TUNED ( |Joty et al, 20T4l ), EAYERED 
( |Gautam and Bhattacharyya, 2014 1 , DISCOTK- 
PARTY dJoty et al, 20T^ and UPC-STOUT 
dGonzMez et al., 2014 ). They are all hybrid 


"'ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-vl3a.pl 
^http://www.cs.umd.edu/ snover/tercom 
®http://www.cs.cmu.edu/ alavie/METEOR/download/meteor- 
1.4.tgz 


metrics. Eor fairness, we also give the re¬ 
sult of the metric with the best performance 
on average in the single metrics, VERTA-W 
dComelles and Atserias, 20Id] ). 

System level correlations are shown in Tabled 
According to Table |6l DPM can get higher corre¬ 
lations than BEEU and TER on the three data sets. 
DPM also gets higher correlations than METEOR 
on WMT 2012 and WMT 2014. The experiment 
results show that DPM can effectively evaluate the 

^Hybrid metrics directly use the scores of many kinds of 
metrics, such as BLEU, TER, METEOR and some syntax- 
based metrics, so we think they are hybrid metrics. Eor the 
metrics using different kinds of information types (lexicon, 
syntax and semantic information) as features, we still think 
they are single metrics, because they don’t use the score of 
other metrics. 





















































Eanguage 

cs-en 

de-en 

es-en 

fr-en 

avg 

BEEU 

.157 

.191 

.189 

.210 

.187 

METEOR 

.212 

.275 

.249 

.251 

.247 

•spede07_pP 

.212 

.278 

.265 

.260 

.254 

DPM 

.146 

.187 

.211 

.183 

.182 

DPME 

.227 

.279 

.279 

.252 

.259 


(a) Sentence level correlations on WMT 2012. 


Eanguage 

cs-en 

de-en 

es-en 

fr-en 

ru-en 

avg 

BEEU 

.199 

.220 

.259 

.224 

.162 

.213 

METEOR 

.265 

.293 

.324 

.264 

.239 

.277 

•SIMPBEEU-RECAEE 

.260 

.318 

.387 

.303 

.234 

.301 

DPM 

.179 

.204 

.237 

.194 

.146 

.192 

DPME 

.258 

.296 

.316 

.269 

.227 

.273 


(b) Sentence level correlations on WMT 2013. 


Eanguage 

cs-en 

de-en 

fr-en 

hi-en 

ru-en 

avg 

BEEU 

.216 

.259 

.367 

.286 

.256 

.277 

METEOR 

.282 

.334 

.406 

.420 

.329 

.354 

BEER 

.284 

.337 

.417 

.438 

.333 

.362 

•*DISCOTK-PARTY-TUNED 

.328 

.380 

.433 

.434 

.355 

.386 

DPM 

.182 

.224 

.331 

.301 

.243 

.256 

DPME 

.283 

.332 

.404 

.426 

.324 

.354 


(c) Sentence level correlations on WMT 2014. 


Table 7: Sentence level correlations on WMT 2012, WMT 2013 and WMT 2014. The value in bold is 
the best result in each column, avg stands for the average result of all the language pairs for each metric 
on WMT 2012, WMT 2013 or WMT 2014. Metrics with * are the hybrid metrics. Metrics with • are the 
best performance metrics in each data set. 


hypothesis. In order to evaluate the lexical infor¬ 
mation, we also introduce the F-score to DPM and 
add some extra linguistic resources to F-score to 
more accurately evaluate the similarity between 
the hypothesis and the reference on lexicon. Af¬ 
ter adding F-score, the performance of DPMF is 
greatly improved over DPM on the three data sets. 
So it is effective to add F-score to DPM to evalu¬ 
ate the lexical information. On WMT 2012, WMT 
2013 and WMT 2014, DPMF gets higher corre¬ 
lations than METEOR. Compared with the best 
metric SEMPOS in WMT 2012, DPME achieves 
higher correlations on the three language pairs cs- 
en, es-en and fr-en, and gets 1.3 points improve¬ 
ment over SEMPOS on average. Compared with 
the best metric METEOR in WMT 2013, DPME 
achieves higher correlations on all the language 
pairs except an equal correlation on fr-en. On av¬ 
erage, DPME obtains 2.3 points improvement over 
METEOR. Compared with the best single met¬ 
ric VERTA-W in WMT 2014, the correlation im¬ 
provement of DPME is 1.4 points. DPME also 


outperforms the hybrid metrics EAYERED and 
DISCOTK-PARTY, but there is still some work 
to do to catch up with the best hybrid metric for 
DPME. 

4.3 Sentence Level Correlation 

To evaluate the performance of DPM and DPME 
further, we also carry out the experiments on sen¬ 
tence level. On sentence level, Kendall’s r corre¬ 
lation coefficient is used, r is calculated using the 
following equation. 

num_con_pairs — num_dis_pairs 

T = -^^ 

num_con_pairs -T num_dis_pairs 

nurri-conjpairs is the number of concordant pairs 
and nurri-dis-pairs is the number of disconcor- 
dant pairs. 

In the experiments, we give the results of 
DPM and DPME respectively. Eor comparison, 
the baseline metrics are the widely-used met¬ 
rics, BEEU and METEOR. In addition, we also 
give the correlations of the metric with the best 






performance on average according to the pub¬ 
lished results of WMT 2012, WMT 2013 and 
WMT 2014. The metrics with the best per¬ 
formance on average are spede07_pP on WMT 
2012, SIMPBLEU-RECALL on WMT 2013 and 
DISCOTK-PARTY-TUNED on WMT 2014 re¬ 
spectively. Because DISCOTK-PARTY-TUNED 
is a hybrid metric, we also give the result of the 
single metric with the best performance on aver¬ 


age, BEER (Stanojevic and Sima’an, 20141. 

Sentence level correlations are shown in Ta¬ 
ble |7] Erom Table |7j we can see that the per¬ 
formance of DPM is not good and a little lower 
than BLEU. The reason is that DPM mainly con¬ 
siders the syntactic structure information. After 
introducing lexical information (E-score), DPME 
achieves a significant improvement over DPM and 
BLEU. DPME outperforms METEOR on WMT 

2012 and is comparable with METEOR on WMT 

2013 and WMT 2014. The above results show 
that DPME can give an effective evaluation for the 
hypothesis on sentence level. Compared with the 
best metric spede07_pP on WMT 2012, DPME can 
achieve a comparable correlation. 


5 Conclusion and Future Work 


translation quality. 
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